Learning Objective: Learn how to generate and answer questions about data using data visualization and basic data transformations.
from altair import *
import numpy as np
cars = load_dataset('cars')
cars.head()
One of the early champions of Exploratory Data Analysis, or EDA, was John W. Tukey, who wrote a book by the same name in 1977. He defined data analysis as:
Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.
A nice modern take on EDA is provided by Garrett Grolemund and Hadley Wickham in their R for Data Science book. Grolemund and Wickham clarify that EDA is an iterative cycle that consists of:
In other words, EDA is all about exploring your data by generating and answering questions. This section of the course can be viewed as Python-based lecture notes for the content in R for Data Science.
Read through the EDA section of Grolemund and Wickham.
They distill the core questions of EDA down to the following two:
Note that this section assumes a tidy dataset whose columns are variables and rows are observations/samples.
The first step in EDA is to explore individual variables separately. The following things are useful in exploring a single variable:
These things enable you to understand the variation within a single variable. From R4DS:
Variation is the tendency of the values of a variable to change from measurement to measurement.
We begin with the cars dataset and a single quantitative variable, the Acceleration. First let's compute summary statistics by calling the describe method on that column:
cars['Acceleration'].describe()
Next, let's study the variation in this variable by visualizing the distributions of its observations. We begin with a tick chart:
Chart(cars).mark_tick().encode(
x='Acceleration'
)
We see that most values are nearby the mean of $15.5$, but that there is considerable variation about this value. Some questions we might ask:
A tick chart is a great way of understanding the variation in a single variable. However, in dense regions of the distribution it is difficult to see exactly how many samples are present. In many cases a better visualization for a single quantitative variable is a histogram. Here is a histogram of the Acceleration:
Chart(cars).mark_bar().encode(
X('Acceleration:Q', bin=Bin(maxbins=30)),
Y('count(*)')
)
The histogram is an invaluable visual tool for understanding a quantitative variable. It will immediately give you a sense of the variation and distribution of a single variable.
To show how important this type of exploration is, let's look at the IMDB and Rotten Tomatoes rating from the movies dataset. IMDB and Rotten Tomatoes are two websites that allow users to rate movies. You might hypothesize that users would give similar ratings at these two websites. Let's see what the data says:
movies = load_dataset('movies')
movies.head()
Here is a histogram of the IMDB ratings:
Chart(movies[['IMDB_Rating']]).mark_bar().encode(
X('IMDB_Rating:Q', bin=Bin(maxbins=30)),
Y('count(*)')
)
Note the extremely smooth distribution with a well defined peak. Now the Rotten Tomatoes ratings:
Chart(movies[['Rotten_Tomatoes_Rating']]).mark_bar().encode(
X('Rotten_Tomatoes_Rating:Q', bin=Bin(maxbins=30)),
Y('count(*)')
)
Woah! The distribution isn't smooth and doesn't have a well defined peak. These two ratings distributions show a different patterns of variation. So much so that we are prompted to begin asking questions:
Obviously, these histograms alone don't contain sufficient information to answer these questions. But maybe the rest of the data does...
The tick chart and histogram work only for quantitative variables. For categorical (ordinal, nominal) variables, different approaches are needed. First, note that the .describe() method gives us different information for categorical columns:
cars['Origin'].describe()
Another useful method is .unique(), which lists the unique values of the categorical variable:
cars['Origin'].unique()
For a categorical variable, a bar chart provides similar information as the histogram does for a quantitative variable:
Chart(cars[['Origin']], width=400, height=100).mark_bar().encode(
X('Origin:N', sort=SortField(field='Origin', op='count', order='descending')),
Y('count(*)', axis=Axis(ticks=5))
)
This chart leads to other questions:
When exploring a single variable it is important to identify and understand missing or unusual values. Let's look at the distribution of movies ratings:
Chart(movies[['MPAA_Rating']], width=400, height=100).mark_bar().encode(
X('MPAA_Rating:N', sort=SortField(field='MPAA_Rating', op='count', order='descending')),
Y('count(*)', axis=Axis(ticks=5))
)
Notice that there is both a Not Rated value as well as a null value. This probably indicates that some movies were not rated and others didn't have a rating that got recorded in this dataset. The null values are an example of missing values. Let's look to see how these values are encoded in Python:
movies['MPAA_Rating'].unique()
It looks like the missing values are encoded with Python's None value. We will look more at missing values later in the course. During the EDA process, it is important to identify missing values and understand their significance in the dataset.
It is also important to understand unusual values. As an example of unusual values, let's look at the Production_Budget variable of the movies dataset:
Chart(movies[['Production_Budget']]).mark_tick().encode(
X('Production_Budget', scale=Scale(type='log'))
)
Viewing this variable on the log scale as ticks shows that there are a number of movies with extremely low production budgets (as low as $200!). Interestingly, this variation is more difficult to see in a histogram. Again, this brings up questions:
When exploring a single variable, you are primarily concerned with understanding the variation within that variable. With two or more variables you can explore relationships or covariation between the variables. From R4DS:
Covariation is the tendency for the values of two or more variables to vary together in a related way.
We begin by exploring the covariation between two quantitative variables. The most common chart type to use for this type of exploration is the two-dimensional scatter chart. In Altair, this amounts to a point mark type, with the two quantitative variables encoded with the x and y position.
Again, EDA is driven by asking and answering of questions. Let's look again at the cars dataset and ask the following question:
What other variables effect fuel efficiency (MPG) and how?
First, you might hypothesize that higher horsepower cars to have lower MPG. Let's try to answer that by visualizing these two variables:
Chart(cars).mark_point().encode(
x='Horsepower',
y='Miles_per_Gallon'
)
Indeed, as expected, we observe a downwards trend in MPG as the horsepower increases. Are there other variables that could effect the MPG? Maybe lighter cars get better MPG?
Chart(cars).mark_point(opacity=0.2).encode(
x='Weight_in_lbs',
y='Miles_per_Gallon'
)
Yes! What about displacement? The displacement of the engine is a measure of the internal volume of the combined cylinders and is expected to be directly related to the amount of fuel that is burned on each firing of the engine.
Chart(cars).mark_point().encode(
x='Displacement',
y='Miles_per_Gallon'
)
Again, we see a trend that matches our expectations. These trends also suggest other questions. What causes cars to have higher horsepower? Wouldn't larger displacement engines have more horsepower? Let's look at those two variables:
Chart(cars).mark_point().encode(
y='Horsepower',
x='Displacement',
)
All of this suggests the following picture: big cars with big engines get worse gas mileage. One interesting observation about this chart is that there are number of cars with much higher horsepower than other cars with similar displacement. What is special about those cars? Maybe those engines are turbocharged?
The second case of covariation is between one quantitative and one categorical variable. Remember, for a single quantitative variable, we have seen that the tick chart and histogram give a picture of the variable's distribution. To see how a quantitative variable covaries with a categorical, one option is to use a tick chart and encode the categorical using row, y or color. Let's see how the horsepower varies with the number of cylinders:
Chart(cars).mark_tick().encode(
X('Horsepower'),
Color('Cylinders:N')
)
Here we see that 8 larger cylinder engines generate more horsepower (as expected). However, a single tick chart with cylinders encoded using color makes it difficult to see the individual distributions. Let's encode cylinders using row intead:
Chart(cars).mark_tick().encode(
X('Horsepower'),
Row('Cylinders:O')
)
Much better! Now we can see the clear differences in the horsepower distributions grouped by cylinders. As expected, there is an overall trend for more cylinders to generate more horsepower, but there is a surprising amount of a variation.
An alternative to grouped tick chart is a grouped histogram. Here we encode the origin using with the row to create a facet of histograms:
Chart(cars).mark_bar().encode(
X('Horsepower', bin=Bin(maxbins=30)),
Y('count(*)', axis=Axis(ticks=5), title='N'),
Row('Origin:N')
).configure_cell(width=400, height=100)
The third case of covariation is between two categorical variables. In this case, it is common to encode the count using size or color. Here is an example that looks at the covariation between a movies MPAA rating and its genre:
Chart(movies).mark_circle().encode(
X('Major_Genre', sort=SortField(field='Major_Genre', op='count')),
Y('MPAA_Rating', sort=SortField(field='MPAA_Rating', op='count', order='descending')),
Size('count(*)')
)
Another option is to use a heatmap to encode the counts (or other aggegation) of two categorical variables. In Altair a heatmap can be generated using the text mark (with no text and the applyColorToBackground option) and the row and column. Here is the above example as a heatmap:
Chart(movies).mark_text(
applyColorToBackground=True,
).encode(
Column('Major_Genre',
sort=SortField(field='Major_Genre', op='count'),
axis=Axis(labelAngle=-90, labelAlign='right', orient='bottom')
),
Row('MPAA_Rating', sort=SortField(field='MPAA_Rating', op='count', order='descending')),
text=Text(value=' '),
color='count(*)'
).configure_scale(
textBandWidth=30,
bandSize=30
)